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Abstract 

Least  squares  projections  are  a  useful  way  of  describing  the  relationship  between 
random  variables.     These  include  conditional  expectations  and  projections  on  additive 
functions.     Series  estimators,  i.e.  regressions  on  a  finite  dimensional  vector  where 
dimension  grows  with  sample  size,  provide  a  convenient  way  of  estimating  such 
projections.     This  paper  gives  convergence  rates  these  estimators.     General  results  are 
derived,  and  primitive  regularity  conditions  given  for  power  series  and  splines. 


Keywords:     Nonparametric  regression,  additive  interactive  models,  random  coefficients, 
polynomials,  splines,  convergence  rates. 


1.        Introduction 

Least  squares  projections  of  a  random  variable     y     on  functions  of  a  random  vector 
x     provide  a  useful  way  of  describing  the  relationship  between     y     and     x.     The  simplest 
example  is  linear  regression,  the  least  squares  projection  on  the  set  of  linear 
combinations  of     x,     as  exemplified  in  Rao  (1973,  Chapter  4).     An  interesting 
nonpar ametric  example  is  the  conditional  expectation,  the  projection  on  the  set  of  all 
functions  of     x     with  finite  mean  square.     There  are  also  a  variety  of  projections  that 
fall  in  between  these  two  polar  cases,  where  the  set  of  functions  is  larger  than  all 
linear  combinations  but  smaller  than  all  functions.     One  example  is  an  additive 
regression,  the  projection  on  functions  that  are  additive  in  the  different  elements  of 
x.     This  case  is  motivated  partly  by  the  difficulty  of  estimating  conditional 
expectations  when     x     has  many  components:  see  Breiman  and  Stone  (1978),  Breiman  and 
Friedman  (1985),  Friedman  and  Stuetzle  (1981),  Stone  (1985),  and  Zeldin  and  Thomas 
(1977).     A  generalization  that  includes  some  interaction  terms  is  the  projection  on 
functions  that  are  additive  in  some  subvectors  of     x.     Another  example  is  random  linear 
combinations  of  functions  of     x,     as  suggested  by  Riedel  (1992)  for  growth  curve 
estimation. 

One  simple  way  to  estimate  nonparametric  projections  is  by  regression  on  a  finite 
dimensional  subset,  with  dimension  allowed  to  grow  with  the  sample  size,  e.g.  as  in 
Agarwal  and  Studden  (1980),  Gallant  (1981),  Stone  (1985),  Cox  (1988),  and  Andrews  (1991), 
which  will  be  referred  to  here  as  series  estimation.     This  type  of  estimator  may  not  be 
good  at  recovering  the  "fine  structure"  of  the  projection  relative  to  other  smoothers, 
e.g.  see  Buja,  Hastie,  and  Tibshirani  (1989),  but  is  computationally  simple.     Also, 
projections  often  show  up  as  nuisance  functions  in  semiparametric  estimation,  where  the 
fine  structure  is  less  important. 

This  paper  derives  convergence  rates  for  series  estimators  of  projections. 
Convergence  rates  are  important  because  they  show  how  dimension  affects  the  asymptotic 


accuracy  of  the  estimators  (e.g.   Stone  1982,   1985).     Also,  they  are  useful  for  the 
theory  of  semiparametric  estimators  that  depend  on  projection  estimates  (e.g.   Newey 
1993a).     The  paper  gives  mean-square  rates  for  estimation  of  the  projection  and  uniform 
convergence  rates  for  estimation  of  functions  and  derivatives.     Fully  primitive 
regularity  conditions  are  given  for  power  series  and  regression  splines,  as  well  as  more 
general  conditions  that  may  apply  to  other  types  of  series. 

Previous  work  on  convergence  rates  for  series  estimates  includes  Agarwal  and  Studden 
(1980),  Stone  (1935,   1990),   Cox  (1988),  and  Andrews  and  Whang  (1990).     This  paper 
improves  on  mau>y  previous  results  in  the  convergence  rate  or  generality  of  regularity 
conditions.     Uniform  convergence  rates  for  functions  and  their  derivatives  are  given  and 
some  of  the  results  allow  for  a  data-based  number  of  approximating  terms,  unlike  all  but 
Cox  (1988).     Also,  the  projection  does  not  have  to  equal  the  conditional  expectation,  as 
in  Stone  (1985,   1990)  but  not  the  others. 


2.        Series  Estimators 

The  results  of  this  paper  concern  estimators  of  least  squares  projections  that  can 
be  described  as  follows.     Let     z     denote  a  data  observation,     y     and     x     (measurable) 
functions  of     z,     with     x     having  dimension     r.     Let     ^     denote  a  mean-squared  closed, 
linear  subspace  of  the  set  of  all  functions  of     x     with  finite  mean-square.     The 
projection  of     y     on     &     is 

(2.1)  gQ(x)  =  argmin     „E[<y  -  g(x)>2]. 

An  example  is  the  conditional  expectation,     gn(x)  =  E[y|x],     where     !»     is  the  set  of  all 
measurable  functions  of     x     with  finite  mean-square.     Two  further  examples  will  be  used 
as  illustrations,  and  are  of  interest  in  their  own  right. 


Additive- Interactive  Projections:     When     x     has  more  than  a  few  distinct  components  it  is 
difficult  to  estimate     E[y|x],     a  feature  often  referred  to  as  the  "curse  of 
dimensionality."     This  problem  motivates  projections  that  are  additive  in  functions  of 
subvectors  of     x,     so  that  the  individual  components  have  smaller  dimension  than     x.     One 

general  way  to  describe  these  is  to  let     x.,     (1  =  1 L)     be  distinct  subvectors  of 

x,     and  specify  the  space  of  functions  as 

(2.2)  W  =  {Z^g^)  :  ngt(xt)Z)  <  «h 

For  example,  if     L  =  r     and  each     x.     is  just  a  component  of     x,     the  set     "§     consists  of 
additive  functions.     The  projection  on     !*     generalizes  linear  regression  to  allow  for 
nonparametric  nonlinearities  in  individual  regressors.     The  set  of  equation  (2.2)  is  a 
further  generalization  that  allows  for  nonlinear  interactive  terms.     For  example,  if  each 
x.     is  just  one  or  two  dimensional,  then  this  set  would  allow  for  just  pairwise 
interactions. 

Cova.ria.te  Interactive  Projections:     As  discussed  in  Riedel  (1992),  problems  in  growth 
curve  estimation  motivate  considering  projections  that  are  random  linear  combinations  of 

functions.     To  describe  these,  suppose     x  =  (w,u),     where     w  =  (w w.  )'      is  a  vector 

of  covariates  and  let     Hf,     (1  =  1,...,  L)     be  sets  of  functions  of     u.     Consider  the 
set  of  functions 

(2.3)  i?  =  <£.     w.h.(u)  :  h-  €  H.},     E[ww'  |u]     is  nonsingular  with  probability  one. 

In  a  growth  curve  application    u     represents  time,  so  that  each     h.(u)     represents  a 
covariate  coefficient  that  is  allowed  to  vary  over  time  in  a  general  way. 


The  estimators  of     gn(x)     considered  here  are  sample  projections  on  a  finite 
dimensional  subspace  of     W,     which  can  be  described  as  follows.     Let     p   (x)  = 


(p.„(x) Pj-j-tx))'      be  a  vector  of  functions,  each  of  which  is  an  element  of     § '. 

Denote  the  data  observations  by     y.     and     x.,     (i  =  1,   2,   ...),     and  let     y  = 

K  K  K 

(y. y  )'      and     p     =  [p   (x  ) p   (x  )],     for  sample  size     n.     An  estimator  of     gn(x) 

is 

(2.4)  g(x)  =  pK(x)'rt,     n  =  (pK'pKfpK/y, 


where     (•)       denotes  a  generalized  inverse,  and     K.     subscripts  for     g(x)     and     n      have 

K     K 
been  suppressed  for  notational  convenience.     The  matrix     p   'p       will  be  asymptotically 

nonsingular  under  conditions  given  below,  making  the  choice  of  generalized  inverse 

asymptotically  irrelevant. 

The  idea  of  sample  projection  estimators  is  that  they  should  approximate     gn(x)     if 

K     is  allowed  to  grow  with  the  sample  size.     The  two  key  features  of  this  approximation 

K  K 

are  that     1)  each  component  of     p   (x)     is  an  element  of     *§,     and     2)  p   (x)     "spans"     !* 

as     K     grows  (i.e.  for  any  function  in     !*,     K     can  be  chosen  big  enough  that  there  is  a 
linear  combination  of     p   (x)     that  approximates  it  arbitrarily  closely  in  mean  square). 

Under  1),     ir     estimates     n  s  (E[p   (x)p   (x)'])    E[p   (x)y]  = 

K        K  -IK 

(E[p   (x)p   (x)'])    E[p   (x)gn(x)],     the  coefficients  of  the  projection  of     gn(x)     on 

K  K 

p   (x).     Thus,  under  1)  and  2),     p   (x)'re     will  approximate     gn(x).     Consequently,  when  the 

estimation  error  in     ir     is  small,     g(x)     should  approximate     gn(x). 

Two  types  of  approximating  functions  will  be  considered  in  detail.     They  are  power 
series  and  splines. 


Power  Series:     Let     A  =  (A, A  )'     denote  an  r-dimensional  vector  of  nonnegative 

1  r  A 

r  Art 

integers,   i.e.   a  multi-index,  with  norm     |A|   =  £._,A.,     and  let     x     s  TT„     x.  For  a 

sequence     (A(k)).  _      of  distinct  such  vectors,  a  power  series  approximation  corresponds 

to 


(2.5)  pkK(x)  =  xMk),     (k  =  1,  2,  ...). 


Throughout  the  paper  it  will  be  assumed  that     Mk)     are  ordered  so  that      |A(k)|      is 
monotonically  increasing.     For  estimating  the  conditional  expectation     E[y|x],     it  will 
also  be  required  that     (Mk)),  _.     include  all  distinct  multi-indices.     This  requirement 
is  imposed  so  that     E[y|x]     can  be  approximated  by  a  power  series.     Additive-interactive 
projections  can  be  estimated  by  restricting  the  multi-indices  so  that  each  term     p,  v(x) 
is  an  element  of     !?.     This  can  be  accomplished  by  requiring  that  the  only     Mk)     that  are 
included  are  those  where  indices  of  nonzero  elements  are  the  same  as  the  indices  of  a 
subvector     x.     for  some     I.     In  addition,  covariate  interactive  terms  can  be  estimated  by 

taking  the  multi-indices  to  have  the  same  dimension  as     u     and  specifying  the 

Mk) 
approximating  functions  to  be     PtK(x)  =  w»(l.}u        •     where     £(k)     is  an  integer  that 

selects  a  component  of     w. 

Power  series  have  a  potential  drawback  of  being  sensitive  to  outliers.     It  may  be 

possible  to  make  them  less  sensitive  by  using  power  series  in  a  bounded,  one-to-one 

transformation  of  the  original  data.     An  example  would  be  to  replace  each  component  of     x 

by  a  logit  transformation     l/(l+e  I). 

The  theory  to  follow  uses  orthogonal  polynomials,  which  may  help  alleviate  the  well 

Mk) 
known  multicollinearity  problem  for  power  series.     If  each     x  is  replaced  with  the 

product  of  orthogonal  polynomials  of  order  corresponding  to  components  of     A(k),     with 

respect  to  some  weight  function  on  the  range  of     x,     and  the  distribution  of     x.     is 

similar  to  this  weight,  then  there  should  be  little  collinearity  among  the  different 

Mk) 
x        .     The  estimator  will  be  numerically  invariant  to  such  a  replacement  (because 

I  Mk)  |      is  monotonically  increasing),  but  it  may  alleviate  the  well  known 

multicollinearity  problem  for  power  series. 

Regression  Splines:     A  regression  spline  is  a  series  estimator  where  the  approximating 
function  is  a  smooth  piecwise  polynomial  with  fixed  knots  (join  points).     They  have 


some  attractive  features  relative  to  power  series,  including  being  less  sensitive  to 
singularities  in  the  function  being  approximated  and  less  oscillatory.     A  disadvantage 
is  that  the  theory  requires  that  knots  be  placed  in  the  support  and  be  nonrandom  (as  in 
Stone,  1985),  so  that  the  support  must  be  known.     The  power  series  theory  does  not 
require  a  known  support. 

To  describe  regression  splines  it  is  convenient  to  begin  with  the  one-dimensional     x 
case.     For  convenience,  suppose  that  the  support  of     x     is     [-1,1]     (it  can  always  be 
normalized  to  take  this  form)  and  that  the  knots  are  evenly  spaced.     Let     (•)       =  1(»  > 
0)(»).     An     m        degree  spline  with  L+l  evenly  spaced  knots  on     [-1,1]     is  a  linear 


combination  of 


(2.6)  P*L(v)     ■  " 


v  ,     Os.Jts'in, 
<[v  +  1  -  2(*-m)/(L+l)]+>m,     m+1  £  k  £  m+L 


Multivariate  spline  terms  can  be  formed  by  interacting  univariate  ones  for  different 
components  of     x.     For  a  set  of  multi-indices     <A(k)>,     with     X.(k)  £  m+L-1     for  each     j 
and     k,     the  approximating  functions  will  be  products  of  univariate  splines,  i.e. 


(2-7)  ^>XM,LUj)'{k  =  l'-K)- 


Note  that  corresponding  to  each     K    there  is  a  number  of  knots  for  each  component  of     x 
and  a  choice  of  which  multiplicative  components  to  include.     Throughout  the  paper  it  will 
be  assumed  that  each  ratio  of  numbers  of  knots  for  a  pair  of  elements  of     x     is  bounded 
above  and  below.     For  estimating  the  conditional  expectation     E[y|x],     it  will  also  be 
required  that     (Mk)).  _.     include  all  distinct  multi-indices.     This  requirement  is 
imposed  so  that     E[y|x]     can  be  approximated  by  interactive  splines. 
Additive-interactive  projections  can  be  estimated  by  restricting  the  multi-indices  in  the 
same  way  as  for  power  series.     Also,  covariate  interactive  terms  can  be  estimated  by 
forming  the  approximating  functions  as  products  of  elements  of     u     with  splines  in     x 


analogously  to  the  power  series  case. 

The  theory  to  follow  uses  B-splines,   which  are  a  linear  transformation  of  the  above 
basis  that  is  nonsingular  on     [-1,1]     and  has  low  multicollinearity.  The  low 
multicollinearity  of  B-splines  and  recursive  formula  for  calculation  also  lead  to 
computational  advantages;  e.g.  see  Powell  (1981). 

Series  estimates  depend  on  the  choice  of  the  number  of  terms     K,     so  that  it  is 
desirable  to  choose     K     based  on  the  data.     With  a  data-based  choice  of     K,     these 
estimates  have  the  flexibility  to  adjust  to  conditions  in  the  data.     For  example,  one 
might  choose     K     by  delete  one  cross  validation,  by  minimizing  the  sum  of  squared 
residuals    E-_Jy-g  -K(x.)]  ,     where     g  .„(x.)     is  the  estimate  of  the  regression 
function  computed  from  all  the  observations  but  the     i     .     Some  of  the  results  to  follow 
will  allow  for  data  based     K. 


General  Convergence  Rates 


This  section  derives  some  convergence  rates  for  general  series  estimators.     To  do 

this  it  is  useful  to  introduce  some  conditions.     Let     u  =  y  -  h.(x),     u.  =  y.  -  h_.(x.). 

J  Oil         0     1 

1/2 
Also,  for  a  matrix     D     let     II 0 II  =  (trace(D'D)]      ,     for  a  random  matrix     Y,     IIYII     = 

v    1/v 
<E[ IIYII  ]>      ,     v  <  eo,     and     IIYII       the  infimum  of  constants     C     such  that     Prob(  IIYII  <  C) 

00 

1. 

2 
Assumption  3.1:     {(y.,x.)>     is  i.i.d.     and     E[u   |x]     is  bounded  on  the  support  of     x.. 


The  bounded  second  conditional  moment  assumption  is  quite  common  in  the  literature  (e.g. 
Stone,  1985).     Apparently  it  can  be  relaxed  only  at  the  expense  of  affecting  the 
convergence  rates,  so  to  avoid  further  complication  this  assumption  is  retained. 


The  next  Assumption  is  useful  for  controlling  the  second  moment  matrix  of  the  series 
terms. 

Assumption  3.2:     For  each     K     there  is  constant,  nonsingular  matrix     A     such  that  for 

K  K  K        K 

P   (x)  =  Ap   (x),     the  smallest  eigenvalue  of     EIP   (x)P   (x)'l     is  bounded  away  from  zero 

uniformly  in     K. 


Since  the  estimator     g(x)     is  invariant  to  nonsingular  linear  transformations,  there  is 

K  K 

really  no  need  to  distinguish  between     p   (x)     and     P   (x)     at  this  point.     An  explicit 

transformation     A     is  allowed  for  in  order  to  emphasize  that  Assumption  3.2  is  only 
needed  for  some  transformation.     For  example,  Assumption  3.2  will  not  apply  to  power 
series,  but  will  apply  to  orthonormal  polynomials. 

Assumption  3.2  is  a  normalization  that  leads  to  the  series  terms  having  specific 
magnitudes.  The  regularity  conditions  will  also  require  that  the  magnitude  of  P  (x) 
not  grow  too  fast  with  the  sample  size.     The  size  of     P   (x)     will  be  quantified  by 

(3.1)  <d(K)  =  suP|A|=dxeX..aXPK(x)H 

1/2 
where    I     is  the  support  of    x,     II 0 II  =  (trace(D'D)l    "  for  a  matrix    D,     X    denotes  a 

vector  of  nonnegative  integers,     and 

ixi  =  r.r,x  ,  axpK(x)  »  aulpK(x)/ax11-"ax  r. 

**j=l  r  1  r 

That  is,     <j(K)     is  the  supremum  of  the  norms  of  derivatives  of  order     d. 

The  following  condition  places  some  limits  on  the  growth  of  the  series  magnitude. 
Also,  it  allows  for  data  based  choice  of    K,     at  the  expense  of  imposing  that  series 
terms  are  nested. 


Assumption  3.3:     There  are     K(n)     and     K(n)     such  that     K(n)  £  K  s  K(n)     with  probability 

K  K+l 

approaching  one  and  either  a)     p   (x)  is  a  subvector  of     p       (x)     for  all     K     with     K(n)  s 

K  <  K+l  s  K(n)     and    £K^KCQ(K)4/n  — »  0,     or;     b)     The     PK(x)     of  Assumption  3.2  is  a 

subvector  of     P       (x)     for  all     K     with     K(n)  s  K  <  K+l  s  R~(n)     and     <   (K(n))  /n  — >  0. 


As  previously  noted,  a  series  estimate  is  invariant  to  nonsingular  linear  transformations 
of     p   (x),     so  that  in  part  a)  it  suffices  that  any  such  transformation  form  a  nested 
sequence  of  vectors.     Part  b)  is  more  restrictive,  in  requiring  that  the     <P   (x))     from 
Assumption  3.2  be  nested,   but  imposes  a  less  stringent  requirement  on  the  growth  rate  of 
K.     Also,  if     K     is  nonrandom,  so  that     K(n)  =  K  =  K(n),     the  nested  sequence  requirment 
of  both  part  a)  and  b)  will  be  satisfied,  because  that  requirement  is  vacuous  when     K  = 
K. 

In  order  to  specify  primitive  hypotheses  for  Assumptions  3.2  and  3.3  it  must  be 
possible  to  find     P   (x)     satisfying  the  eigenvalue  condition,  and  having  known  values 
for,  or  bounds  on,     Cn(K).     That  is,  one  needs  explicit  bounds  on  series  terms  where  the 
eigenvalues  are  bounded  away  from  zero.     It  is  possible  to  derive  such  bounds  for  both 

power  series  and  regression  splines,  when     x     is  continuously  distributed  with  a  density 

—4 
that  is  bounded  away  from  zero.     These  bounds  lead  to  the  requirements  that     K  /n  — >  0 

2 
for  power  series  and     K  /n  — »  0     for  regression  splines  with  nonrandom     K.     These  results 

are  described  in  Sections  5  and  6.     It  is  also  possible  to  derive  such  results  for 

Fourier  series,  but  this  is  not  done  here  because  they  are  most  suitable  for 

approximation  of  periodic  functions,  which  have  fewer  applications.     It  may  also  be 

possible  to  derive  results  for  Gallant's  (1981)  Fourier  flexible  form,  although  this  is 

more  difficult,  as  described  in  Gallant  and  Souza  (1991).     In  terms  of  this  paper,  the 

problem  with  the  Fourier  flexible  form  is  that  the  linear  and  quadtratic  terms  can  be 

approximated  extremely  quickly  by  the  Fourier  terms,  leading  to  a  multicollinearity 

problem  so  severe  that  simultaneous  satisfaction  of  Assumptions  3.2  and  3.3  would  impose 


very  slow  growth  rates  on     K. 

Assumptions  3.1  -  3.3  are  useful  for  controlling  the  variance  of  a  series  estimator. 
The  bias  is  the  error  from  the  finite  dimensional  approximation.     A  supremum  Sobolev 
norm  will  be  used  to  quantify  this  approximation.     For  a  measurable  function     f(x) 
defined  on     X     and  a  nonnegative  integer     d,     let 

|f|d  =  maX|A|sdmaXx€Z|aAf(x)l' 

and     |f  |  ,     equal  to  infinity  if     5  f(x)     does  not  exist  for  some     \X\   ad     and     x  e  J. 
Many  of  the  results  will  be  based  on  the  following  polynomial  approximation  rate 
condition. 

Assumption  3.4:     There  is  a  nonnegative  integer     d     and  constants     C,  a  >  0     such  that 
for  all     K     there  is     n    with     |g  -  p   *  ir  I  .  s  CK     . 

This  condition  is  not  primitive,  but  is  known  to  be  satisfied  in  many  cases.     Typically, 
the  higher  the  degree  of  derivative  of     g(x)     that  exists,  the  bigger     a     and/or     d     can 
be  chosen.     This  type  of  primtive  condition  will  be  explicitly  discussed  for  power  series 
in  Section  5  and  for  splines  in  Section  6.     It  is  also  possible  to  obtain  results  when  the 
approximation  rate  is  for  an     L       norm,  rather  than  the  sup  norm.     However,  this 
generalization  leads  to  much  more  complicated  results,  and  so  is  not  given  here. 

These  assumptions  will  imply  both  mean  square,  and  uniform  convergence  rates  for  the 
series  estimate.     The  first  result  gives  mean-square  rates.     Let     F(x)     denote  the  CDF  of 
x. 

Theorem  3.1:     If  and  Assumptions  3.1  -  3.4  are  satisfied  for    d  =  0     then 

l^fgUJ-gjxjf/n  =  O  (K/n  *  K2*), 


Slg(x)-gn(x))2dF(x)  =  0  (K/n  +  K  2cL). 
O  p 
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The  two  terms  in  the  convergence  rate  essentially  correspond  to  variance  and  bias.     The 
first  conclusion,  on  sample  mean  square  error,   is  similar  to  those  of  Andrews  and  Whang 
(1991)  and  Newey  (1993b),  but  the  hypotheses  are  different.     Here  the  number  of  terms     K 
is  allowed  to  depend  on  the  data,  and  the  projection  residual     u     need  not  satisfy 
E[u|x]  =  0,     at  the  expense  of  requiring  Assumptions  3.2  and  3.3,  that  were  not  imposed 
in  these  other  papers.     Also,  the  second  conclusion,  on  integrated  mean  square  error,  has 
not  been  previously  given  at  this  level  of  generality,  although  Stone  (1985)  gave 
specific  results  for  spline  estimation  of  an  additive  projection. 
The  next  result  gives  uniform  convergence  rates. 

Theorem  3.2:     If  Assumptions  3.1,  3.2,  3.3  b),  and  3.4  are  satisfied  for  a  nonnegative 
integer     d     then 

\g  -  g0\d  =  Op((;d(K)[(K/n)1/2  +  K~*]). 

There  does  not  seem  to  be  in  the  literature  any  previous  uniform  convergence  results  that 
cover  derivatives  and  general  series  in  the  way  this  one  does.     Furthermore,  for  the 
univariate  power  series  case,  the  convergence  rate  that  is  implied  by  this  result 
improves  on  that  of  Cox  (1988),  as  further  discussed  in  Section  4.     These  uniform  rates 
do  not  attain  Stone's  (1982)  bounds,  although  they  do  appear  to  improve  on  previously 
known  rates. 

For  specific  classes  of  functions     !?     and  series  approximations,  more  primitive 
conditions  for  Assumptions  3.2  -  3.4  can  be  specified  in  order  to  derive  convergence 
rates  for  the  estimators.     These  results  are  illustrated  in  the  next  two  Sections,  where 
convergence  rates  are  derived  for  power  series  and  regression  spline  estimators  of 
additive  interactive  and  covariate  interactive  functions. 
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4.       Additive  Interactive  Projections 

This  Section  gives  convergence  rates  for  power  series  and  regression  spline 
estimators  of  additive  interactive  functions.     The  first  regularity  condition 
restricts     x     to  be  continuously  distributed. 

Assumption  4.1:     x     is  continuously  distributed  with  a  support  that  is  a  cartesian 
product  of  compact  intervals,  and  bounded  density  that  is  also  bounded  away  from  zero. 

This  assumption  is  useful  for  showing  that  the  set  of  additive-interactive  functions  is 
closed.     Also,  this  condition  leads  to  Assumptions  3.2  and  3.3  being  satisfied  with 
explicit  formulae  for     C«(K).     For  power  series  it  is  possible  to  generalize  this 
condition,  so  that  the  density  goes  to  zero  on  the  boundary  of  the  support.     For 
simplicity  this  generalization  is  not  given  here,  although  the  Lemmas  given  in  the 
appendix  can  be  used  to  verify  the  Section  3  conditions  in  this  case. 

It  is  also  possible  to  allow  for  a  discrete  regressor  with  finite  support,  by 
including  all  dummy  variables  for  all  points  of  support  of  the  regressor,  and  all 
interactions.     Because  such  a  regressor  is  essentially  parametric,  and  allowing  for  it 
does  not  change  any  of  the  convergence  rate  results,  this  generalization  will  not  be 
considered  here. 

Under  Assumption  4.1  the  following  condition  will  suffice  for  Assumptions  3.2  and 
3.3. 

4. 

Assumption  4.2:     Either  a)     PkK(x)     is  a  power  series  with     K  /n  — »  0,  or     b)  PkK(x)     are 

r      —  2 

splines,  the  support  of     x     is     (-1,1]  ,     K(n)  =  K(n)  =  K,     and     K  /n  — »  0. 

It  is  possible  to  allow  for  data  based     K     for  splines  and  obtain  similar  mean-square 
convergence  rates  to  those  given  below.     This  generalization  is  not  given  here  because  it 
would  further  complicate  the  statement  of  results. 
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A  primitive  condition  for  Assumption  3.4  is  the  following  one. 


Assumption  4.3:     Each  of  the  components     g,  (x,),     (I  =  1,   ....  L),     is  continuously 
differentiate  of  order     &    on  the  support  of     x.. 


Let     a     denote  the  maximum  dimension  of  the  components  of  the  additive  interactive 
function.     This  condition  can  be  combined  with  known  results  on  approximation  rates  for 
power  series  and  splines  to  show  that  Assumption  3.4  is  satisfied  for     d  =  0  and     a  =  &/n. 
and  with     a  =  /i-d     when     a  =  1.     The  details  are  given  in  the  appendix. 

These  conditions  lead  to  the  following  result  on  mean-square  convergence. 

Theorem  4.1:     If  Assumptions  3.1,  and  4.1  -  4.3  are  satisfied,  then 

Z^&xJ-gJxjf/n  =  0  (K/n  +  K~2A/"\>,     S[g(x)-g0(x)]2dF(x)  =  O  (K/n  *  K~2a/a;. 

The  integrated  mean  square  error  result  for  splines  that  is  given  here  has  previously 
been  derived  by  Stone  (1990).     The  rest  of  this  result  is  new,  although  Andrews  and  Whang 
(1990)  give  the  same  conclusion  for  the  sample  mean  square  error  of  power  series  under 
different  hypotheses.     An  implication  of  Theorem  4.1  is  that  power  series  will  have  an 
optimal  integrated  mean-square  convergence  rate  if  the  number  of  terms  is  chosen  randomly 
between  certain  bounds.     If  there  are     C  a  c  >  0     such  that     K  =  cn  ,     K  =  Cn  ,     where     y 
=  a/(2A+a),     and    a  >  3o/2,     then  the  mean-square  convergence  rate     n        ~         ,     which 
attains  Stone's  (1982)  bound.     The  side  condition  that     &  >  3n/2     is  needed  to  ensure     K 
=  Cn  satisfies  Assumption  4.2.     A  similar  side  condition  is  present  for  the 

spline  version  of  Stone  (1990),  but  it  has  the  less  strigent  form  of    o.  >  a/2. 

Theorem  3.2  can  be  specialized  to  obtain  uniform  convergence  rates  for  power  series 
and  spline  estimators. 
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Theorem  4.2:     If  Assumptions  3.1,  and  4.1  -  4.3  are  satisfied,  then  for  power  series 
\g  -  g0\0  =  0  (K[(K/n)1/2  +  K^l), 


and  for  regression  splines, 

If  -  gQ\0  =  Op(K1/2[(K/n)1/2  +  K^l). 

Obtaining  uniform  convergence  rates  for  derivatives  is  more  difficult,  because 
approxirnaton  rates  are  difficult  to  find  in  the  literature.     When  the  argument  of  each 
function  is  only  one  dimensional,  an  approximation  rate  follows  by  a  simple  integration 
argument  (e.g.  see  Lemma  A.  12  in  the  Appendix).     This  approach  leads  to  the  following 
convergence  rate  for  the  one-dimensional  (i.e.  additive  model)  case. 

Theorem  4.3:     If  Assumptions  3.1  and  4.1-4.3  are  satisfied,     1  =  1,     d  <  &,     p     (x)     is  a 
power  series  or  a  regression  spline  with    m  fc  d,     h-d,  then  for  power  series, 

i-  i  n  /T?l+2d,rlz,   .1/2       -,-A+d.. 

\g  -  gQ\d  =  O  (K       {[K/n]       +  jc        ;;, 

and  for  splines 

\g  -  gQ\d  =  0  (K  {[K/n]        +  K        }). 

In  the  case  of  power  series,  it  is  possible  to  obtain  an  approximation  rate  by  a 
Taylor  expansion  argument  when  the  derivatives  do  not  grow  too  fast  with  their  order. 
The  rate  is  faster  than  any  power  of     K,     leading  to  the  following  result. 
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Theorem  4.4:     If  Assumptions  3.1  and  4.1-4.3  are  satisfied,     p      (x)     is  a  power  series, 
and  there  is  a  constant     C     such  that  for  each  multi-index     X,     the     X      partial 
derivative  of  each  additive  component  of     g(x)     exists  and  is  bounded  by     C       ,     then  for 
any  positive  integers    a     and     d, 

\g-g0\d-  op(K1+2d{[K/n]1/2  *  jfa;;. 

The  uniform  convergence  rates  are  not  optimal  in  the  sense  of  Stone  (1982),  but  they 
improve  on  existing  results.     For  the  one  regressor,  power  series  case  Theorem  4.2 
improves  on  Cox's  (1988)  rate  of     0  (K  <[K/n]        +  K~A>).     For  the  other  cases  there  do 
not  seem  to  be  any  existing  results  in  the  literature,  so  that  Theorems  4.2  -  4.4  give 
the  only  uniform  convergence  rates  available.     It  would  be  interesting  to  obtain  further 
improvements  on  these  results,  and  investigate  the  possibility  of  attaining  optimal 
uniform  convergence  rates  for  series  estimators  of  additive  interactive  models. 


5.        Covariate  Interactive  Projections. 

Estimation  of  random  coefficient  projections  provides  a  second  example  of  how  the 
general  results  of  Section  3  can  be  applied  to  specific  estimators.     This  Section  gives 
convergence  rates  for  power  series  and  regression  spline  estimators  of  projections  on  the 
set     &     described  in  equation  (2.3).     For  simplicity,  results  will  be  restricted  to 
mean-square  and  uniform  convergence  rates  for  the  function,  but  not  for  its  derivatives. 
Also,  the     K.     in  equation  (2.3)  will  each  be  taken  equal  to  the  set  of  all  functions  of 
u     with  finite  mean-square. 

Convergence  rates  can  be  derived  under  the  following  analog  to  the  conditions  of 
Section  4. 
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Assumption  5.1:     i)     u     is  continuously  distributed  with  a  support  that  is  a  cartesian 

product  of  compact  intervals,   and  bounded  density  that  is  also  bounded  away  from  zero. 

K  K/£ 

ii)     K     is  restricted  to  be  a  multiple  of     £     and     p   (x)  =  w®p        (u)     where  either 

_4 
p.  K(u)     is  a  power  series  with     K  /n  — »  0,     or     b)  PkK(u)  are  splines,  the  support  of     u 

r      —  2 

is     [-1,1]  ,     K(n)  =  K(n)  =  K,     and     K  /n  — >  0.     iii)  Each  of  the  components     h,  (u),     (I 

=  1,   ...,  L),     is  continuously  differentiable  of  order    a.    on  the  support  of     u.;     iv)     w 

is  bounded,  and     E[ww'  |u]     has  smallest  eigenvalue  that  is  bounded  away  from  zero  on  the 

support  of     u.. 


These  conditions  lead  to  the  following  result  on  mean-square  convergence. 
Theorem  5.1:     If  Assumptions  3.1  and  5.1  are  satisfied,  then 

Z^&xJ-grfxjf/n  =  0  (K/n  +  k'2^1"),     S[g(x)-g()(x)]2dF(x)  =  O  (K/n  +  K~2<i/r). 

Also,  for  power  series  and  splines  respectively, 
\g  -  g0\0  =  Op(K[(K/n)1/2  *  K~*/r]), 
\g  -  g0\0  =  0p(K1/2[(K/n)1/2  *  K~^r]). 

An  important  feature  of  this  result  is  that  the  convergence  rate  does  not  depend  on     £, 
but  is  controlled  by  the  dimension  of  the  coefficient  functions  and  their  degree  of 
smoothness.     This  feature  is  to  be  expected,  since  the  nonparametric  part  of  the 
projection  is  the  coefficient  functions. 
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Appendix:     Proofs  of  Theorems 


Throughout,   let     C     be  a  generic  positive  constant  and     A    .(B)     and     A        (B)     be 

mm  max 

minimum  and  maximum  eigenvalues  of  a  symmetric  matrix     B.     A  number  of  lemmas  will  be 
useful  in  proving  the  results.     First  some  Lemmas  on  mean-square  closure  of  certain 
spaces  of  functions  are  given. 

2 
Lemma  A.1:     If     H     is  linear  and  closed  and     E[\\w\\  ]  <  w     then     {w' a+h(x)  :  h  e  K}     is 

closed. 

Proof:     Let     u  =  w-P(w|W),     so  that     w'a  +  h(x)  =  u'a  +  h(x)+P(w|  W  a.     Therefore,   it 
suffices  to  assume  that     w     is  orthogonal  to     H.     It  is  well  known  that  finite 
dimensional  spaces  are  closed,  and  that  direct  sums  of  closed  orthogonal  subspaces  are 
closed,  giving  the  conclusion.     QED. 

Lemma  A.2:     Consider  sets     H  .,     (j  =  1,  ....  J),     of  functions  of  a  random  vector     x.     If 
each     H  .     is  closed  and    w     is  a    J  x  1     random  vector  such  that    Cl(x)  =  E[ww'  \x]     is 
bounded  and  has  smallest  eigenvalue  bounded  away  from  zero,  then     { T  .  ,w  h  Xx)  :  h  .  €  H  .} 
is  closed. 

Proof:     By  iterated  expectations,     E[<w'h(x)>  ]  =  E[h(x)'n(x)h(x)]  £  CE[h(x)'h(x)] 

Lemma  A.3:     Suppose  that  i)  for  each     x.,     (I  =  1,  ...,  L),     if     x     is  a  subvector  of    x„ 
then    x  =  x.,     for  some    I' ,  and     ii)  There  exists  a  constant    c  >  1    such  that  for  each 
I,     with  the  partitioning    x  =  (x'[txCt' )' ,     for  any    a(x)  >  0,     cSa(x)dlF(x()'F(xCt)]  * 

E[a(x)l  £  c~1Sa(x)d[F(xl)'F(xCl)].     then     {Z^^x^t  ElhfxJ2]  <  »,  I  =  1 L) 

is  closed  in  mean-square. 

L  -    -  2  1/2 

Proof:     Let     H  =  iZ^hfa)}     and     II a II ^  =  [Ja(x)  dF(x)J      .     By  Proposition  2  of  Section 

4  of  the  Appendix  of  Bickel,  Klaasen,  Ritov,  and  Wellner  (1993),     K     is  closed  if  and 
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only  if  there  is  a  constant     C     such  that  for  each     h  e  H,     II h II     £  Cmax.dlh.ll    }     for  some 
h,     (note     h-     need  not  be  unique).     Following  Stone  (1990,  Lemma  1),  suppose  that  the 
maximal  dimension  of     x.     is     r,     and  suppose  that  this  property  holds  whenever  the 
maximal  dimension  of  the     x-     is     r-1     or  less.     Then  there  is  a  unique  decomposition     h 
=  E»,h.(x.),     such  that  for  all     x.t     that  are  strict  subvectors  of     x., 
E[h.(xJS(x., )]  =  0     for  all  measurable  functions  of     x.,     with  finite  mean-square. 

Consequently,  it  suffices  to  show  that  for  any  "maximal"     xf,     that  is  not  a  proper 

2 
subvector  of  any  other     x.,     that  there  is  a  constant     c  >  1     such  that     E[h(x)  ]  £ 

-1  ~    2 

c    E[h.(x. )   1.     To  show  this  property,  note  that  that  holding  fixed  the  vector  of 

— c  *•  ~ 

components     x.     of     x     that  are  not  components  of     x»     each     I  *  k,     h.(xJ     is  a 

function  of  a  strict  subvector  of     x..     Then, 

E[h(x)2]  s  c"1J<h/t(x/t)  +  ^/the(xi)}2dF(x/t)dF(x^) 

=  c~lSlS<hkixk)  +  lt^khtixt))2dF{xk)]6Fixk) 

=  c'Vir^x^.)2  +  {J^h^x^AdFfx^ldF^) 

a  c'Vl/h^x^dFtx^ldFlx^)  =  c^Elh^x^)2].     QED. 


The  next  few  Lemmas  consist  of  useful  convergence  results  for  random  matrices  with 

dimension  that  can  depend  on  sample  size.     Let     Z     and    Z    denote  symmetric  matrices  such 

matrices,  and     X        ( • )     and     X    .  ( • )     the  smallest  and  largest  eigenvalues  respectively. 
max  mm 

Lemma  A.4:     If     X    .  (Z)  a  C     with  probability  approaching  one  (w.p.a.l)  and  IIZ-ZII  =  o  (1) 

then     X    .  (Z)  s  C    w.p.a.l. 
min 

Proof:     For  a  conformable  vector    ji,     it  follows  by     II  •  II     a  matrix  norm  that 

X    .  (Z)  =  minM    ,,    ,<n'ZM  +  >x'(Z-Z)n>  £  X    .  (Z)  -  X        (Z-Z)  a  X    .  (Z)  -  IIZ-ZII  a 
mm  llfill=l  *■-»"»■  »"  min  max  mm 

C  -  o  (1).     Therefore,     X    .  (Z)  a  C/2    w.p.a.l.       QED 

p  min  r 
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Lemma  A.5:     If     \    .   (Z)  £  C     w.p.a.l,     Ill-Ill   =  o  (1),     and     D       is  a  conformable  matrix 
mm  p  n 

-\/7  --\/7 

such  that     HZ        D   II   =  0  (e   )     for  some     e  ,     then     HZ        D   II   =  0   (e   ). 
n  p     n  n  n  p     n 

Proof:     It  is  easy  to  show  that  for  any  conformable  matrices     A     and     B,     IIABII  s  II A II  •  II B II, 

HA' BAH   £   IIBIIoHA'AII,     and  that  if     B     is  positive  semi-definite,  tr(A'BA)  s   HAH2A        (B), 

max 

-1/? 
IIABII   s   II  Ail  A        (B)     and     IIBAII   s   KAMA        (B).     Let     Z  be  the  symmetric  square  root  of 

max  max 

Z        which  is  equal  to     UAU'      where     U     is  an  orthogonal  matrix  and     A     a  diagonal  matrix 

-1  -1/2 

consisting  of  the  square  roots  of  the  eigenvalues  of     Z         Note  that     Z  is  positive 

—\/y  —i    \y?  —i 

definite  and     A        (Z         )  =  [\        (Z     )]       .     Also  by  Lemma  A.4,     \        (Z     )  =  0  (1).     Then 
max  max  J  max  p 

(A.l)  HZ"1/2D   II2  =  tHD'lZ^-Z^lD  ) 

n  n  n 

s  HZ"1/2D   ll2(l  +   HZ~1/2[Z-Z]Z"1/2II)  +  ll(Z-Z)Z-1D   ll2A        (Z-1)] 
n  n       max 

s  0  (e2)[l  +  o  (1)0  (1)  +  HZ-ZH2X        (Z"1/2)20  (1)]  =  0  (e2).       QED 
p    n  P       P  max  p  p    n 

Let     tr(A)     denote  the  trace  of  a  square  matrix     A     and     u     a  random  matrix  with     n 
rows. 


Lemma  A.6:    Suppose     \    .  (Z)  a  C,     P     is  a     K  x  n     random  matrix  such  that     HP'P/n  -  Zl 
^  nun 

_i  /y 

o  (1)     and     HZ         P'u/Vnll  =  0  (e   ),     and     p  =  PA     where     A     is  a  random  matrix.     Then 
P  P    n 

tr(u'p(p'pfp'u/n)  =  0  (e2). 

Proof:     Let     W  =  P(P'P)~P'     and     W  =  p(p'p)~p'     be  the  orthogonal  projection  operators 

for  the  linear  spaces  spanned  by  the  columns  of     P     and     p     respectively.     Since  the 

space  spanned  by     p     is  a  subset  of  the  space  spanned  by     P,     W-W     is  positive 

semi-definite.     Let     Z  =  P'P/n.     Then  by  Lemma  A.5,  tr(u'Wu/n)  s  tr(u'Wu/n)  = 

HZ'^P'iWnll2  =  0  (€2).       QED. 
P    n 

Let     Y     and     G     denote  random  matrices  with  the  same  number  of  columns  and     n 
rows,  and  let     u  =  Y-G.     For  a  matrix     p     let     it  =  (p'p)  p'Y     and     G  =  pit. 
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2 
Lemma  A.7:     If     tr(u'p(p'p)  p'u/n)  =  0  (e   ).     Then  for  any  conformable  matrix     n, 

IIG-GII2/n  s  0  (€2)  +   IIG-pnll2/n. 
p    n 

Proof:     For     W     and     W     as  in  the  proof  of  Lemma  A.6,  by     Wp  =  p,     and     I-W     idempotent, 
IIG-GII2/n  =  trfY'WY  -  Y'WG  -  G'WY  +  G'G]/n  =  trlu'Wu  +  G'(I-W)G]/n 
s  trlu'Wu  +  (G-pn)'(I-W)(G-pw)]/n  s  0  (e2)  +   IIG-prell2/n.       QED 

Lemma  A.8:     If     X    .   (I)  a  C,     llp'p/n  -  Zll  =  o  (1),     and     tr[u'p(p'pf p'u/n]  =  0  (e2), 
min  p  f  r  r   -f  pn 

then  for  any  conformable  matrix    n, 

Htt-wII2  s  0  (€2)  +  0  (l)IIG-pirll2/n, 
p    n  p 

tr[(ir-w)'Z(jr-n)]  a  0  (e2)  +  0  (l)IIG-pirll2/n. 
p    n  p 

Proof:     By  Lemma  A.4,     X    .  (p'p/n)  £  C     w.p.a.l,  so     X    .  (p'p/n)~    =  0  (1).     Therefore, 

J  mm  r  r  min  r  r  p 

for     G  =  pit, 

\\n-n\l2  s  X    .  (p'  p/n)-1tr[(ii-ir)'  (p'  p/n)(ir-ir)]  =  0  (DtrfY'WY  -  Y'WG  -  G'WY  +  G'G]/n 

s  0  (l)[tr(u'Wu/n)  +  IIG-GII2/n]  =  0  (e2)  +  0  (l)IIG-GII2/n. 
P  p    n  p 

To  prove  the  second  conclusion,  note  that  by  the  triangle  inequality  and  the 
same  arguments  as  for  the  previous  equation, 

tr[(n-w)'Z(ir-ie)]  -  tr[(i-ii)'[Z-p'p/n](ii-ir)]  +  (n-n)'(p'p/n)(ir-ii) 

s  llir-irll2HZ-p'p/nH  +  0  (e2)  +  0  (l)IIG-GII2/n  =  0  (e2)  +  0  (l)IIG-GII2/n.       QED. 

p    n  p  p    n  p 

Lemma  A.9:     If     z.,...,z       are  i.Ld.  then  for  any  vector  of  functions     a   (z)  = 
a1K(z),...,aKK(z))'     and     K  =  K(n), 

\\Zn,aK(n)(z.)/n  -  E[aK(n)(z)]\\  =  0  ({E[aK(n)(z)' aK(n)(z)]/n}1/2). 
i=l  i  p 
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Proof:     Let     K  =  K(n).     By  the  Cauchy-Schwartz  inequality, 


nillj^a^z.J/n  -  E[aK(z)]H]  s  {EHJj"  aK(Zj)/ii  -  E[aK(z)]H2]>1/2 
£  <E[llaK(z)ll2/n]>1/2, 


so  the  conclusion  follows  by  the  Markov  inequality.     QED. 


Now  let     Z  =  r.n,PK(x.)PK(x.)'/n     and     Z  =  /P^xjP^xl'dFfx). 
^i=l         1  1 


Lemma  A.10:     Suppose  that  Assumptions  3.1  -  3.3  are  satisfied.     If  Assumptions  3.3  a)  is 
also  satisfied 

iiz  -  zii  =  opc^K^K<0c/c;4/n72/2;  =  o  w. 

If  Assumption  3.3  b)  is  also  satisfied  then 

HZ  -  ZII  =  O  ([$n(K)4/n]1/2)  =  o  (V. 
p      0  p 

Proof:     Let     L,  =  7."  PK(x.)PK(x.)'/n     and     Z„  =  JTK(x)PK(x)'dF(x).     To  show  the  first 

2 

K  K  K 

conclusion,  note  that  by  the  Cauchy  Schwartz  inequality,  for     a     (z)  =  P   (x)®P   (x), 

E[maxKsK£RHZK-ZKH]  *  MZ^^-^W2  =  q^^ij£f%)**Azim>1/2 
S  <lKSK3KE[«aK2(z)..2]/n>1/2  =  <ZKsKaRE[HPK(x)..4]/n>1/2  s  B^^toW^. 

Then  by  the  Markov  inequality,     maxKsK3^ llL.-Zj.ll  =  0  (tIKsKSK<o(K)  /nl       )-     The  firSt 
conclusion  then  follows  by     IIZ-ZII  s  maXj. sKsjt II L. -Zj.il     w.p.a.l.     To  show  the  second 
conclusion,  note  that  w.p.a.l,     Z     and    Z     are  submatrices  of    Lr     and    Z^    respectively, 
whence     IIZ-ZII  s  IIZj^-Z^H.     The  conclusion  then  follows  from  Lemma  A.9.     QED. 

K  K 

Let     y  =  (yx yn)\     g  =  (g^) g^x^)',     and     p  =  [p   (x^ p   (xn)]'. 
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Lemma  A.lh     If  Assumptions  3.1  -  3.3  are  satisfied,  then 
(y-g)'p(p'p)~p'(y-g)/n  =  0  (K/n). 

Proof:     Let     u  a  y-g,     p.  =  PK(x.),     P  =  [Pj P   ]',     and     Z  =  E[P'P]/n.     By  Assumption 

3.3,  there  is  a  random  matrix     A     such  that     p  =  PA.     Also,  by  Lemma  A.  9  and  an  argument 

like  that  of  the  proof  of  Lemma  A.10,     HP'P/n-ZII  -^-»  0.     Also,  by  Assumption  3.1, 

X    .   (Z)  £  C.     Also,     ElP.u.]  =  0     by  each  element  of     P.     in     £,     and  ElP.P'.u2]  = 
mm  11  l  ill 

2 
E[P.P'.E[u.  |x.]]  s  CZ,     so  by  the  data  i.i.d., 

E[IIZ_1/2P'u/nll2]  =  EUHu'PZ^P'uM/n2  =  tr(Z~1/2(T.nT  .n,E[P.u.P'.u.])Z"1/2)/n2 

^i=l^j=l       i  i   J  J 

=  tr(Z_1/2E[P.P'.u2]Z"1/2)/n  s  tr(CIrr)/n  s  CK/n. 
ill  K 

-1/2—  —      1/2 

Therefore,  by  the  Markov  inequality,     IIZ        P'u/nll  =  0  ((K/n)       ).     The  conclusion  then 

follows  by  Lemma  A.  6.     QED. 

The  next  few  lemmas  give  approximation  rate  results  for  power  series  and  splines. 

Lemma  A.12:     For  power  series,  if  the  support     I     of     x.     is  a  compact  box  in     R       and 
fix)     is  continuously  differentiable  of  order    {,     then  there  are    a,  C  >  0    such  that 
for  each     K     there  is    n     with     \f-p    'n\.<CK     ,     where     a  =  (/r     for  when     d  =  0, 
and     a  =  {.-d     when    r  =  1    and    d  a  I. 


Proof:     For  the  first  conclusion,  note  that  by     |A(K)|     monotonic  increasing,  the  set  of 

all  linear  combinations  of     p  (x)     will  include  the  set  of  all  polynomials  of  degree 

1/r 
CK  for  some     C    small  enough,  so  Theorem  8  of  Lorentz  (1986)  applies.     For     r  =  1, 

note  that     d  p       (x)/Sx       is  a  spanning  vector  for  power  series  up  to  order     K.     By  the 

first  conclusion,  there  exists     C     such  that  for  all     k    there  is  w     such  that,     for 

fK(x)  =  PK+d(x)'7r,     it  is  the  case  that     supx 1 3df (x)/3xd  -  3dfK(x)/axd|   s  C«K"*+d      The 

second  conclusion  then  follows  by  integration  and  boundedness  of  the  support.     For 
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example,  for     d  =  1,     x     the  minimum  of  the  support,  and  the  constant  coefficient  chosen 
so  that     f(x)  =  f„(x),     equal  to  the  minimum  of  the  support     x,      |f(x)-f   (x)|    s 

V 

S    |3f(x)/3x  -  df   (x)/3x|dx  s  CK~*+1.     QED 


Lemma  A.13:     For  power  series,  if     1     is  star-shaped  and  there  is     C     such  that  fix)     is 
continuously  differentiable  of  all  orders  and  for  all  multi-indices     A,     maxr\d  f(x)\    s 
C       },     then  for  all     a,  d  >  0     there  is    C  >  0     such  that  for  all     K     there  is     n     with 
\f-pK'n\d  s  CK~U. 


Proof:     By     X     star-shaped,  there  exists     x  €  I    such  that  for  all     x  e  I,  0x  +  (l-£)x  € 
X     for  all     Oss  l.     For  a  function     f(x),     let  P(f,m,x)     denote  the  Taylor  series  up 
to  order     m     for  an  expansion  around     x.     Note     3P(f,m,x)/3x.  =  P(3f/3x.,m-l,x),     so  that 
by  induction     3  P(f,m,x)  =  P(3  f,m-|A|,x).     Also,  3  f(x)     also  satisfies  the  hypotheses, 
so  that  by  the  intermediate  value  form  of  the  remainder, 

maxx€ll3Af(x)  -  P(3Xf,m-|  A|,x)|   s  (^"/[(m-d)!]. 

Next,  let     m(K)     be  the  largest  integer  such  that     P(f,m,x)     is  a  linear  combination  of 
p   (x),     and  let     f"K(x)  =  P(f,m(K),x).     By  the  "natural  ordering"  hypothesis,  there  are 
constants     C      and     C_     such  that     C  m(K)     s  K  s  C?m(K)  ,     so  that  for  any     a  >  0, 
Cm(K)/[(m(K)-d)!]  s  CK"a,     and 

sup  |  X  |  sd  I '  a*f  (x)_sAf k(x)  '   =  sup  |  X  |  sd  X '  ^ (x)_P(aXf  'm(K)_  I X I  >x)  I   "  CK_a-     QED- 

Lemma  A.14:    For  splines,  if    X     Is  a  compact  box  and    fix)     is  continuously 
differentiable  of  order    (,     then  there  are    a,  C  >  0    such  that  for  all     K    there  is    n 
iZ-p^'nl      <  CK~a,     where    a  =  /-d    for    r  =  1    and    d  s  m-1    and     a  =  £/r    for    d  =  0. 


Proof:     The  result  for     d  =  0  follows  by  Theorem  12.8  of  Schumaker  (1981).     For  the  other 
case,  note  that     3  p   (x)/3x       is  a  spanning  vector  for  splines  of  degree     m-d,     with  knot 
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spacing  bounded  by     CK        for     K     large  enough  and  some     C.     Therefore,  by  Powell  (1981), 
there  exists     wK     such  that  for     fK(x)  =  pK(x)'7rK,     supxl  3df(x)/3xd  -  3df    (x)/3xd|    < 
OK  The  conclusion  then  follows  by  integration,   similarly  to  the  proof  of  Lemma 

A.  12.       QED. 

The  next  two  Lemmas  show  that  for  power  series  and  splines,  there  exists     P   (x) 
such  that  Assumption  3.2  is  satisfied,  and  give  explicit  bounds  on  the  series  and  their 
derivatives. 


Lemma  A.15:     For  power  series,  if  the  support  of    x     is  a  Cartesian  product  of  compact 
intervals,  say  of  unit  intervals,  with  density  bounded  below    CTl,    (xf)  (1-xJ  ,     then 
Assumptions  3.2  and  equation  (3.1)  are  satisfied,  with     CjfJO  s  CK         J       and     P  (x)     is 
a  subvector  of     P       (x)     for  all     K  £  1. 

Proof:     Following  the  definitions  in  Abramowitz  and  Stegun  (1972,  Ch.  22),  let     C  .    («) 

( a] 
denote  the  ultraspherical  polynomial  of  order     k     for  exponent     a,     /is 

n21-2ar(k+2a)/<k!(k+a)[r(a)]2>,     and    p^te)  =  [A^f  1/2C(£}(o:).     Also,  let     <c.(x.)  = 

12        2     1 
(2x  .-x  -x  .  )/(x  .-x .)     and  define 
J     J     J        J     J 

(v+.5) 

pk(x)  -  njW(k>  w- 

K  K 

P   (x)     is  a  nonsingular  combination  of     p   (x)     by  the  "natural  ordering"  assumption  (i.e. 

by      |A(k)|      monotonic  increasing).     Also,  for     P(x)     absolutely  continuous  on     X  = 

r      1    2  r       2  i    "j 

[j._  [x.,x.]     with  pdf  proportional  to  f[._.[(x  .-x.)(x.-x.)]     ,     and  by  the  change  of  there 

is  a  constant     C    with 

„        „  r      iv  +.5)  Iv+.S) 

X    .  (XPK(x)PK(x)'dP(x))  £  X    .  (JV,[p     Jw      (<C.(x.))p    Ju     (<c.(x.))']dP(x))  =  C, 
min  mm      j=l  M         J    J  m         J    J 

where  the  inequality  follows  by     P   (x)     a  subvector  of     ®._,[p      w       (a:.(x.)) 

for     M  =  maxksKIMk)|     and     P^M  =  ^^ V}Alx))- 

Next,  by  differentiating  22.5.37  of  Abramowitz  and  Stegun  (for     m     there  equal  to     v 
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here)  and  solving,  it  follows  that  for     «sk,     d^^ixVdx1  =  C*Civ***'5)(x)     so 
that  by  22.14.2  of  Abramowitz  and  Stegun,   for     Mk-s)     as     in  equation  (2.3), 

.5+i>  +2A  , ,  -     - 

\a\vU)\  s  cn'iwuk-.)]      J     J  *  ciMk-.)|/rf-5*,HW  s  ck5+v+2\ 

KK.  J— 1  J 

1/r 
where  the  last  equality  follows  by      |A(k-s)|   a  CK      .     QED. 

Lemma  A.16:     For  splines,  if  Assumption  4.1  is  satisfied  then  Assumptions  3.2  and 
equation  (3.1)  are  satisfied,  with     ^(K)  a  CK(1/2)+d. 

Proof:  First,  consider  the  case  where  x  =  x_  and  let  I  =  n.  [-1,1].  Let  B  .  (<r), 
be  the  B-spline  of  order  m,  for  the  knot  sequence  -1  +  2j/[L+l],  j  =  ....  -1,  0,  -1, 
...     with  left  end-knot     j,     and  let 

P*k(V  S  V2)1/\-m-l,L/V'     (k  "  l 4+m+1'  l  '  '•   •-   r)' 


kIC 


P,.^(x)  =  n.Ll(Mk)>0)P,  -^  „.x(x.). 

,K,   ,        .   K, 


n^V^u^i'V 


Then  existence  of  a  nonsingular  matrix     A     such  that     P   (x)  =  Ap   (x)     for     x  e  I     follows 

by  inclusion  in     p   (x)     of  all  multiplicative  interactions  of  splines  for  components  of 

x     corresponding  to  components  of     h(x)     and  the  usual  basis  result  for  B-splines  (e.g. 

Theorem  19.2  of  Powell,  1981). 

Next,  a  well  known  property  of  B-splines  that  for  all     x,     the  number  of  elements  of 

P  (x)  =  (P.K(x),...,PKK(x))     that  are  nonzero  are  bounded,  uniformly  in    K.     Also,  when 

the  elements  of     x     are  i.i.d.  uniform  random  variables,  and  noting  that 

[2(m+l)/L.](L./2l        r..  (x.)     are  tne  so-called  normalized  B-splines  with  evenly  spaced 

knots,  it  follows  by  the  argument  of  Burman  and  Chen  (1989,  p.  1587)  that  for     P.  ,(x)  = 

(P..(xJ P..        t(x-)),     there  is    C    with    X    .  (I1  P. .  (x)P.  .  (x)'dx)  £  C    for  all 

cl    l  c,L+m+l    c  min      .  c,L        t,L 

positive  integers     L.     Therefore,  the  boundedness  away  from  zero  of  the  smallest 

K  r 

eigenvalue  follows  by     P   (x)  a  subvector  of     ®»_,P/  ,    (x#).  analogously  to  the  proof  of 
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Lemma  A.  14.     Also,  since  changing  even  knot  spacing  is  equivalent  to  rescaling  the 
argument  of  B-splines,     sup_  \d  B  .  Ax)/dx   I   J  CL  ,     d  s  m,     implying  the  bounds  on 
derivatives  given  in  the  conclusion.     The  proof  when     x      is  present  follows  as  in  the 
proof  of  Lemma  8.4.     QED. 

Proof  of  Theorem  3.1:     For  each     K     let    it     be  that  from  Assumption  3.4  with     d  =  0,     so 
that  there  is     C     such  that 

(A.2)  E.nt[gn(x.)-pK(x.)'ji]2/n  s  sup     „|gn(x)-pK(x)'ir|2  s  CK_2a  =  O  (K_2a). 

i=l    u     l  l  xea.     u  p  — 

—      1/2 
Also,   by  Lemma  A.ll,   the  hypothesis  of  Lemma  A. 7  is  satisfied  with     e     =  (K/n)  The 

first  conclusion  then  follows  by  the  conclusion  of  Lemma  A. 7. 

The  second  conclusion  is  proven  using  Lemma  A.  8.     In  the  hypotheses  of  Lemma  A. 8, 

let     Z  =  JTK(x)PK(x)'dF(x)     and     p  ■  [pfyxj] PK(xn)]'.     By  Assumption  3.2  and  Lemmas 

—      1/2 
A. 10  and  A.ll,  the  hypotheses  of  Lemma  A.8  are  satisfied  with     e     =  (K/n)      .     For  each 

K  K 

K     let     re     be  as  above,  except  with     P   (x)     replacing     p   (x).     Then  eq.   (A.2)  is 

satisfied  (with     PK(x)     replacing     pK(x))  and     T[g0(x)-PK(x)'Tt]2dF(x)  s 

\C  *y  —let 

sup     „|gn(x)-P   (x)'ir|      =  0  (K       ).     Then  by  the  second  conclusion  of  Lemma  A.8, 
(A.3)  J[g0(x)-i(x)]2dF(x)  s  2T[g0(x)-PK(x)'n]2dF(x)  +  2(n-7t)'Z(n-ir) 


s  0  (K_2a)  + 


0  (K/n)  +  0  (l)r.ni[gft(x.)-PK(x.)'Tr]2/n  =  0  (K/n  +  K  2a).     QED. 
p  p     ^1=1  °0     l  l  p  — 


Proof  of  Theorem  3.2:     Because     P   (x)     is  a  constant  nonsingular  linear  transformation  of 

K  K  K 

p   (x),     Assumption  3.4  will  be  satisfied  for     P   (x)     replacing    p   (x).     Also,  by 

Assumption  3.3  b),  when     K  a  K,     n    can  be  chosen  so  that     |g_(x)-P   (x)'ir|.  s 

|g_(x)-P%x)'ir|  .,     so  that     |g-(x)-PK(x)'n|  ,  =  0  (K_a).     Also,  it  follows  as  in  the  proof 
0  d  O  d         p — 

of  Theorem  3.1  that  eq.  (A.2)  and  the  hypotheses  of  Lemma  A.8  are  satisfied.     Then  by  the 
first  conclusion  of  Lemma  A.8  and  the  triangle  inequality, 
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(A.4)  lg0-£ld  s    lg0-PK'nld  ♦    IPK/(w-w)ld  s  Op(K  a)  +  Cd(K)"*-wll 

=  °  (K~a)  +  C.(K)0  ((K/n)1/2+K_a)  =  0  (C,  ,(K)[(K/n)1/2+K~a]).       QED. 
pap  pa  ~~ 

Proof  of  Theorem  4.1:     By  Assumption  4.1  it  follows  that  the  hypotheses  of  Lemma  A. 3  are 
satisfied.     Therefore,  by  the  conclusion  of  Lemma  A. 3  there  exists  a  representation 

SqM  =  E/8n^x^'     wflere  for  eacn    *    tne  dimension  of     x,     is  less  than  or  equal  to    a. 

I  K 

Then  by  Lemmas  A.  12  and  A.  14  it  follows  that  for  each     K     there  is     n      with     lgn»-P    'n\n 

s  CK  Then  by  the  triangle  inequality,  for     n  =  £.n.,     Assumption  3.4  is  satisfied 

with     d  =  0     and     a  =  a/a.     Also,  by  Lemma  A.  15,  Assumptions  3.2  and  equation  (3.1)  are 

satisfied,  and  Assumption  4.2  implies  that  Assumption  3.3  holds.     Then  the  conclusion 

1/2 
follows  by  Theorem  3.1  with     d  =  0     and     CQ(K)  =  K     for  power  series  and     Cn(K)  =  K 

for  splines.     QED. 

Proof  of  Theorem  4.2:     It  follows  as  in  the  proof  of  Theorem  4.1  that  Assumptions  3.1  - 

1/2 
3.4  are  satisfied,  with  £n(K)  =  K     for  power  series  and     Cn(K)  =  K  for  splines.     The 

conclusion  then  follows  by  Theorem  3.2.     QED. 

Proof  of  Theorem  4.3:     Follows  as  in  the  proof  of  Theorem  4.2,  except  that  Assumption  3.4 
is  now  satisfied  with     a  =  -&+d     by  Lemmas  A.  12  and  A.  14,  and  Assumption  3.2  and  equation 
(3.1)  are  now  satisfied  with     Cj(K)  =  K  for  power  series,  by  Lemma  A.  15,  and  with 

<d(K)  =  K(1/2)+d     for  splines,  by  Lemma  A.16.     QED. 

Proof  of  Theorem  4.4:     Follows  as  in  the  proof  of  Theorem  4.3,  except  that  Lemma  A.  13  is 
applied  to  show  that  Assumption  3.4  holds  for  any     a  >  0.     QED. 

Proof  of  Theorem  5.1:     The  proof  is  similar  to  that  of  Theorems  4.1  and  4.2.     By     w 
bounded  and  Lemmas  A.  12  and  A.  14,  Assumption  3.4  is  satisfied  with    a  =  <*/r.     Also,  note 
that  Assumption  4.1  is  satisfied  with     u     replacing    x.     Let     P   (x)  =  w«P        (u)     for 
P   (u)     equal  to  the  vector  from  the  conclusion  of  Lemmas  A.  15  or  A.16,  for  power  series 
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and  splines  respectively.     Then  by  the  smallest  eigenvalue  of     E[ww'  |u]     bounded  away 

Y.  Y.  Y./9         Y./!f 

from  zero,     E[P   (x)PK(x)'l  a  CUsElP^^uJP'^fu)'  ])     in  the  positive  semi-definite 

K        K 

sense,  so  the  smallest  eigenvalue  of     E[P   (x)P   (x)']     is  bounded  away  from  zero.     Also, 

bounds  on  elements  of     P   (x)     are  the  same,  up  to  a  constant  multiple,  as  bounds  on 
elements  of     P        (u),     so  that  Assumption  3.3  will  hold.     The  conclusion  then  follows  by 
the  conclusions  to  Theorems  3.1  and  3.2.     QED. 
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